Make IFBench eval tolerate missing or whitespace-shifted responses by resolvicomai · Pull Request #27 · allenai/IFBench

resolvicomai · 2026-05-20T22:15:22Z

Summary

make prompt-to-response lookup tolerant of leading/trailing prompt whitespace
score missing model responses as failed examples instead of raising KeyError
create the eval output directory automatically
add regression coverage for whitespace drift and missing responses

Why

The current sample evaluation can fail before scoring because data/sample_output.jsonl has prompts with trailing whitespace and fewer rows than data/IFBench_test.jsonl. For eval and RLVR-style reward loops, a missing generation should produce a failed score, not abort the whole run.

Context: Prime Intellect IF-RLVR/Bench Algora bounty: https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4

Validation

uv run pytest -q
rm -rf /tmp/ifbench-eval && uv run python -m run_eval --input_data=data/IFBench_test.jsonl --input_response_data=data/sample_output.jsonl --output_dir=/tmp/ifbench-eval

/claim https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4

resolvicomai · 2026-05-21T01:35:31Z

/claim https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4

Scope note: this PR targets the evaluator-reliability path for the IF-RLVR/Bench bounty. It is independent from trainer/reward-helper integrations such as #28: it makes the existing eval path runnable on bundled sample outputs, normalizes prompt whitespace, and treats missing generations as failed rows instead of aborting the run.

resolvicomai · 2026-05-22T15:18:39Z

Friendly bounty-review ping: this PR targets the evaluator-reliability path for the IF-RLVR/Bench bounty. It keeps the scope small by making the existing eval path complete on bundled sample outputs, normalizing prompt whitespace, and scoring missing generations as failed rows instead of aborting. Happy to adjust if the sponsor wants a different slice for the bounty.

Handle missing IFBench eval responses

8de1a75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make IFBench eval tolerate missing or whitespace-shifted responses#27

Make IFBench eval tolerate missing or whitespace-shifted responses#27
resolvicomai wants to merge 1 commit into
allenai:mainfrom
resolvicomai:codex/robust-eval-prompt-matching

resolvicomai commented May 20, 2026 •

edited

Loading

Uh oh!

resolvicomai commented May 21, 2026

Uh oh!

resolvicomai commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

resolvicomai commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Uh oh!

resolvicomai commented May 21, 2026

Uh oh!

resolvicomai commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

resolvicomai commented May 20, 2026 •

edited

Loading